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Abstract 

Fitting high-dimensional data involves a delicate tradeoff between faithful representation and 
the use of sparse models. Too often, sparsity assumptions on the fitted model are too restrictive 
to provide a faithful representation of the observed data. In this paper, we present a novel 
framework incorporating sparsity in different domains. We decompose the observed covariance 
matrix into a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse 
independence model (with a sparse covariance matrix). Our framework incorporates sparse 
covariance and sparse precision estimation as special cases and thus introduces a richer class 
of high-dimensional models. We characterize sufficient conditions for identifiability of the two 
models, viz., Markov and independence models. We propose an efficient decomposition method 
based on a modification of the popular ^-penalized maximum- likelihood estimator (fj-MLE). 
We establish that our estimator is consistent in both the domains, i.e., it successfully recovers 
the supports of both Markov and independence models, when the number of samples n scales 
as n = £l(d 2 logp), where p is the number of variables and d is the maximum node degree in the 
Markov model. Our experiments validate these results and also demonstrate that our models 
have better inference accuracy under simple algorithms such as loopy belief propagation. 

Keywords: High-dimensional covariance estimation, sparse graphical model selection, sparse co- 
variance models, sparsistency, convex optimization. 

1 Introduction 

Covariance estimation is a classical problem in multi-variate statistics. The idea that second-order 
statistics capture important and relevant relationships between a given set of variables is natural. 
Finding the sample covariance matrix based on observed data is straightforward and widely used pQ . 
However, the sample covariance matrix is ill-behaved in high-dimensions, where the number of 
dimensions p is typically much larger than the number of available samples n (p>n). Here, the 
problem of covariance estimation is ill-posed since the number of unknown parameters is larger 
than the number of available samples, and the sample covariance matrix becomes singular in this 
regime. 

*M. Janzamin and A. Anandkumar are with the Center for Pervasive Communications and Computing, Electrical 
Engineering and Computer Science Dept., University of California, Irvine, USA 92697. Email: mjanzami@uci.edu, 
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Various solutions have been proposed for high-dimensional covariance estimation. Intuitively, by 
restricting the class of covariance models to those with a limited number of free parameters, we 
can successfully estimate the models in high dimensions. A natural mechanism to achieve this is to 
impose a sparsity constraint on the covariance matrix. In other words, it is presumed that there are 
only a few (off-diagonal) non-zero entries in the covariance matrix, which implies that the variables 
under consideration approximately satisfy marginal independence, corresponding to the zero pattern 
of the covariance matrix [2j (and we refer to such models as independence models). Many works 
have studied this setting and have provided guarantees for high-dimensional estimation through 
simple thresholding of the sample covariance matrix and other related schemes. See Section Tl. 11 In 
many settings, however, marginal independence is too restrictive and does not hold. For instance, 
consider the dependence between the monthly stock returns of various companies listed on the S&P 
100 index. It is quite possible that a wide range of complex (and unobserved) factors such as the 
economic climate, interest rates etc., affect the returns of all the companies. Thus, it is not realistic 
to model the stock returns of various companies through a sparse covariance model. 

A popular alternative sparse model, based on conditional independence relationships, has gained 
widespread acceptance in recent years [3]. In this case, sparsity is imposed not on the covari- 
ance matrix, but on the inverse covariance or the precision matrix. It can be shown that the 
zero pattern of the precision matrix corresponds to a set of conditional-independence relationships 
and such models are referred to as graphical or Markov models. Going back to the stock market 
example, a first-order approximation is to model the companies in different divisions,] as condition- 
ally independent given the S&P 100 index variable, which captures the overall trends of the stock 
returns, and thus removes much of the dependence between the companies in different divisions. 
High-dimensional estimation in models with sparse precision matrices has been widely studied, and 
guarantees for estimation have been provided under a set of sufficient conditions. See Section 11.11 
for related works. However, sparse Markov models may not be always sufficient to capture all the 
relationships between the variables. Going back to the stock market example, the approximation of 
using the S&P index node to capture the dependence between companies of different divisions may 
not be enough. For instance, there can still be a large residual dependence between the companies 
in manufacturing and mining divisions, which cannot be accounted by the S&P index node. 

In this paper, we consider decomposition of the observed data into two domains, viz., Markov 
and independence domains. We posit that the observed data results in a sparse graphical model 
under structured perturbations in the form of an independence model, see FigdJ This framework 
encapsulates Markov and independence models, and incorporates a richer class of models which 
can faithfully capture complex relationships, such as in the stock market example above, and yet 
retain parsimonious representation. 

Summary of Contributions 

We consider joint estimation of Markov and independence models, given observed data in a high 
dimensional setting. Our contributions in this paper are three fold. First, we derive a set of 
sufficient restrictions, under which there is a unique decomposition into the two domains, viz., 
the Markov and the independence domains, thereby leading to an identifiable model. Second, we 

^ee http://www.osha.gov/pls/imis/sic_manual.html for classifications of the companies. 
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Figure 1: Representation of the covariance decomposition problem, where perturbing the observed covari- 
ance matrix with a structured noise model results in a sparse graphical model. The case where the noise 
model has sparse marginal dependencies is considered. 

propose novel and efficient estimators for obtaining the decomposition, under both exact and sample 
statistics. Third, we provide strong theoretical guarantees for high-dimensional learning, both in 
terms of norm guarantees and sparsistency in each domain, viz., the Markov and the independence 
domain. 

Our learning method is based on convex optimization. We adapt the popular ^i-penalized maximum 
likelihood estimator (MLE), proposed originally for sparse Markov model selection and has efficient 
implementation in the form of graphical lasso [3]. This method involves an l\ penalty on the 
precision matrix, which is a convex relaxation of the £o penalty, in order to encourage sparsity in 
the precision matrix. The Lagrangian dual of this program is a maximum entropy solution which 
approximately fits the given sample covariance matrix. We modify this program to our setting 
as follows: we incorporate an additional i\ penalty term involving the residual covariance matrix 
(corresponding to the independence model) in the max-entropy program. This term can be viewed 
as encouraging sparsity in the independence domain, while fitting a maximum entropy Markov 
model to the rest of the sample correlations. We characterize the optimal solution of the above 
program, and also provide intuitions on the class of Markov and independence model combinations 
which can be incorporated under this framework. As a byproduct of this analysis, we obtain a set 
of conditions for identifiability of the two model components. 

We provide strong theoretical guarantees for our proposed method under a set of sufficient con- 
ditions. We establish that it is possible to obtain sparsistency and norm guarantees in both the 
Markov and the independence domains. We establish that the number of samples n is required to 
scale as n = S7(d 2 logp) for consistency, where p is the number of variables, and d is the maximum 
degree in the Markov graph. The set of sufficient conditions for successful recovery are based on 
the so-called notion of mutual incoherence, which controls the dependence between different sets of 
variables [5]. 

We establish that our estimator reduces to sparse covariance and sparse precision estimation for a 
given choice of a certain tuning parameter. On one end, it reduces to the l\ penalized MLE for 
sparse precision estimation [5]. On the other extreme, it reduces to (soft) threshold estimator for 
sparse covariance estimator, on lines of [6j. Moreover, our conditions for successful recovery are 
similar to those previously characterized for consistent estimation of sparse covariance/precision 
matrix. 

The idea that a combination of Markov and independence models can provide good model-fitting 
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is not by itself new, see [7]. However, the previous approach has several deficiencies, including lack 
of theoretical guarantees, assumption of a known sparsity support for the Markov model, use of 
expectation maximization (EM) which has convergence issues, and so on. See Section \1 . II below for 
a detailed discussion. In contrast, we develop convex optimization methods for decomposition, and 
also provide theoretical guarantees for successful recovery. In summary, in this paper, we provide an 
in-depth study of efficient methods and guarantees for joint estimation of a combination of Markov 
and independence models. 

Our experiments validate our theoretical results and demonstrate that our method is able to learn a 
richer class of models, compared to sparse graphical model selection, while requiring similar number 
of samples. In particular, our method is able to provide better estimates for the overall precision 
matrix, which is dense in general, while the performance of ^i-based optimization is worse since it 
attempts to approximate the dense matrix via a sparse estimate. Additionally, we demonstrate that 
our estimated models have better accuracy under simple distributed inference algorithms such as 
loopy belief propagation (LBP). This is because the Markov components of the estimated models 
tend to be more walk summable [8], since some of the correlations can be "transferred" to the 
residual matrix. Thus, in addition to learning a richer model class, incorporating sparsity in both 
covariance and precision domains, we also learn models amenable to efficient inference. 

1.1 Related Works 

There have been numerous works on high-dimensional covariance selection and estimation, and we 
describe them below. In all the settings below based on sparsity of the covariance matrix in some 
basis, the notion of consistent estimation of the sparse support is known as sparsistency. 

Sparse Graphical Models: Estimation of covariance matrices by exploiting the sparsity pattern 
in the inverse covariance or the precision matrix has a long history. The sparsity pattern of the 
precision matrix corresponds to a Markov graph of a graphical model which characterizes the set of 
conditional independence relationships between the variables. Chow and Liu established that the 
maximum likelihood estimate (MLE) for tree graphical models reduces to a maximum weighted 
spanning tree algorithm where the edge weights correspond to empirical mutual information. The 
seminal work by Dempster [9] on covariance selection over chordal graphs analyzed the convex 
program corresponding to the Gaussian MLE and its dual, when the graph structure is known. 

In the high-dimensional regime, penalized likelihood methods have been used in a number of works 
to achieve parsimony in covariance selection. Penalized MLE based on i\ penalty has been used 
in OHUHH], among numerous other works, where sparsistency and norm guarantees for recovery 
in high dimensions are provided. Graphical lasso [1] is an efficient and popular implementation for 
the ^i-MLE. There have also been recent extensions to group sparsity structures |15U16| . scenarios 
with missing samples |17yi8j. semi-parametric settings based on non-paranormals [19], and to the 
non-parametric setting |20j . In addition to the convex methods, there have also been a number of 
non-convex methods for Gaussian graphical model selection |21H25j . While we base much of our 
consistency analysis on [5] , we also need to develop novel techniques to handle the delicate issue of 
errors in the two domains, viz., Markov and independence domains. 
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Sparse Covariance Matrices: In contrast to the above formulation, alternatively we can impose 
sparsity on the covariance matrix. Note that the zero pattern in the covariance matrix corresponds 
to marginal independence relationships [2,26,27]. High-dimensional estimation of sparse covariance 
models has been extensively studied in [6,28,29], among others. Wagaman and Levina |30| con- 
sider block-diagonal and banded covariance matrices and propose an Isomap method for discovering 
meaningful orderings of variables. The work in [31] provides unified results for sparsistency under 
different sparsity assumptions, viz., sparsity in precision matrices, covariance matrices and models 
with sparse Cholesky decomposition. 

The above works provide strong guarantees for covariance selection and estimation under various 
sparsity assumptions. However, they cannot handle matrices which are combinations of different 
sparse representations, but are otherwise dense when restricted to any single representation. 

Decomposable Regular izers: Recent works have considered model decomposition based on 
observed samples into desired parts through convex relaxation approaches. Typically, each part is 
represented as an algebraic variety, which are based on semi- algebraic sets, and conditions for re- 
covery of each component are characterized. For instance, decomposition of the inverse covariance 
matrix into sparse and low-rank varieties is considered in [32-34] and is relevant for latent Gaussian 
graphical model. The work in [35] considers finding a sparse-approximation using a small number 
of positive semi-definite (PSD) matrices, where the "basis" or the set of PSD matrices is specified 
a priori. In [36], a unified framework is provided for high-dimensional analysis of the so-called 
M-estimators, which optimize the sum of a convex loss function with decomposable regularizers. 
A general framework for decomposition into a specified set of algebraic varieties was studied in [37] . 

The above formulations, however, cannot incorporate our scenario, which consists of a combi- 
nation of sparse Markov and independence graphs. This is because, although the constraints on 
the inverse covariance matrix (Markov graph) and the covariance matrix (independence graph) can 
each be specified in a straightforward manner, their combined constraints on the resulting covari- 
ance matrix is not easy to incorporate into a learning method. In particular, we do not have a 
decomposable regularizer for this setting. 

Multi- Resolution Models: Perhaps the work which is closest to ours is [7J, where multi- 
resolution models with a known hierarchy of variables is considered. The model consists of a combi- 
nation of a sparse precision matrix, which captures the conditional independence across scales, and 
a sparse covariance matrix, which captures the residual in-scale correlations. Heuristics for learning 
and inference are provided. However, the work has three main deficiencies: the sparsity support is 
assumed to be known, the proposed heuristics have no theoretical guarantees for success and the 
models considered are in general not identifiable, due to the presence of both latent variables and 
residual correlations. 
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2 Preliminaries and Problem Statement 



Notation: For any vector v E MP and a real number a G [1, oo), the notation ||u|| refers to the 

I 

4t norm of vector v given by ||v|| a := (X^=i l' y i| a ) °- For any matrix U € MP xp , the induced or the 
operator norm is given by |||£7||| a6 := maxiuii = i ||J7.z||& for parameters a, b € [l,oo). Specifically, 
we use the ioo operator norm which is equivalent to 1^1^ = max^x,...^ X^j=i Wij\- We also have 
III ^ 111 = ll^' T |loo- Another induced norm is the spectral norm \U\ 2 ( or III ^ III) which is equivalent 
to the maximum singular value of U. We also use the element -wise norm notation ||t/||oo 
to refer to the maximum absolute value of the entries of U. Note that it is not a matrix norm 
but a norm on the vectorized form of the matrix. The trace inner product of two matrices is 
denoted by (U,V) := Tr(U T V) = J2i,j U ij v ij- Finally, we use the usual notation for asymptotics: 
f(n) = £L(g(n)) if f(n) > cg(n) for some constant c > and f(n) = 0(g(n)) if f(n) < c'g(n) for 
some constant c' < oo. 



2.1 Gaussian Graphical Models 

A Gaussian graphical model is a family of jointly Gaussian distributions which factor in accordance 
to a given graph. Given a graph G = (V, E), with V = {1, . . . ,p}, consider a vector of Gaussian 
random variables X = [X\ , X2 , . . . , X p ] , where each node i G V is associated with a scalar Gaussian 
random variable Xi. A Gaussian graphical model Markov on G has a probability density function 
(pdf) that may be parameterized as 



/x(x) cx exp 



-x T Jx + h T x 
2 



(1) 



where J is a positive-definite symmetric matrix whose sparsity pattern corresponds to that of the 
graph G. More precisely, 

J(i,j)=0 <=> {i,j)£G. (2) 

The matrix J is known as the potential or concentration matrix, the non-zero entries J(i,j) as the 
edge potentials, and the vector h as the potential vector. The form of parameterization in ([T|) is 
known as the information form and is related to the standard mean-covariance parameterization of 
the Gaussian distribution as 

/i = J _1 h, S = J" 1 , 

where /x := E[X] is the mean vector and S := E[(X — /a)(X — n) T ] is the covariance matrix. 

We say that a jointly Gaussian random vector X with joint pdf /(x) satisfies local Markov property 
with respect to a graph G if 

f(x i \x Af{i )) = f(x i \x v \ i ) (3) 

holds for all nodes i € V, where M(i) denotes the set of neighbors of node i G V and, V\i denotes 
the set of all nodes excluding i. More generally, we say that X satisfies the global Markov property, 
if for all disjoint sets A,BcV, we have 

/(xa,xs|x s ) = /(xa|x s )/(x b |x 5 ). (4) 
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where set S is a separator^] of A and B. The local and global Markov properties are equivalent for 
non-degenerate Gaussian distributions [3]. 

On lines of the above description of graphical models, consider the class of Gaussian model^l 
M(fi, Sg c ), where the covariance matrix is supported on a graph G c (henceforth referred to as the 
conjugate graph), i.e., 

X Gc (i,j) = = (i,j)£G c . (5) 
Recall that uncorrelated Gaussian variables are independent, and thus, 

Xi±Xj = (i,j)tG e . (6) 

Equivalence between pairwise independence and global Markov properties were studied in [21 1261 
[27]. 

In this paper, we posit that the observed model results in a sparse graphical model under structure 
perturbations in the form of an independence model: 

X* + E* R = J* M -\ Supp(J^) = G M , Supp(S^) = G R , (7) 

where Supp(-) denotes the set of non-zero (off-diagonal) entries, Gm denotes the Markov graph and 
Gr, the independence graph. 

2.2 Problem Statement 

We now give a detailed description of our problem statement, which consists of the covariance 
decomposition problem (given exact statistics) and covariance estimation problem (given a set of 
samples). 

Covariance Decomposition Problem: A fundamental question to be addressed is the identifi- 
ability of the model parameters. 

Definition 1 (Identifiability) . A parametric model {Pg : € 0} is identifiable with respect to 
a measure ^ if there do not exist two distinct parameters Q\ ^ 62 such that Pg ± = Pg 2 almost 
everywhere with respect to /i. 

Thus, if a model is not identifiable, there is no hope of estimating the model parameters from 
observed data. A Gaussian graphical model (with no hidden variables) belongs to the family of 
standard exponential distributions [381 Ch. 3]. Under non-degeneracy conditions, it is also in the 
minimal form, and as such is identifiable [39J. In our setting in ([7]), however, identifiability is not 
straightforward to address, and forms an important component of the covariance decomposition 
problem, described below. 

Decomposition Problem: Given the covariance matrix S* = J^~ l — £Jj as in ([7]), where 
is an unknown concentration matrix and £Jj is an unknown residual covariance matrix, how and 
under what conditions can we uniquely recover J* M and Yf R from X*? 

2 A set S C V is a separator for sets A and B if the removal of nodes in S partitions A and B into distinct 
components. 

3 In the sequel, we denote the Markov graph, corresponding the support of the information matrix, as G and the 
conjugate graph, corresponding to the support of the covariance matrix, as G c . 
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In other words, we want to address whether the matrices and T,* R are identifiable, given X*, 
and if so, how can we design efficient methods to recover them. If we do not impose any additional 
restrictions, there exists an equivalence class of models which form solutions to the decomposition 
problem. For instance, we can model X* entirely through an independence model (X* = T,* R ), or 
through a Markov model (X* = JjU )■ However, in most scenarios, these extreme cases are not 
desirable, since they result in dense models, while we are interested in sparse representations with 
a parsimonious use of edges in both the graphs, viz., the Markov and the independence graphs. 
In Section 13. 1\ we provide a sufficient set of structural and parametric conditions to guarantee 
identifiability of the Markov and the independence components, and in Section [3.21 we propose an 
optimization program to obtain them. 

Covariance Estimation Problem: In the above decomposition problem, we assume that the 
exact covariance matrix X* is known. However, in practice, we only have access to samples, and 
we describe this setting below. 

Denote X n as the sample covariance matri^ 

1 n 

k=l 

where xnA,k = l,...,n are n i.i.d. observations of a zero mean Gaussian random vector X ~ 
AA(0, X*), where X := {X\, ...,X p ). Now the estimation problem is described below. 

Estimation Problem: Assume that there exists a unique decomposition X* = — X^ where 

is an unknown concentration matrix with bounded entries and XJj is an unknown sparse residual 
covariance matrix given a set of constraints. Given the sample covariance matrix X n , our goal is 
to find estimates of and XJj with provable guarantees. 

In the sequel, we relate the exact and the sample versions of the decomposition problem. In 
Section [H we propose a modified optimization program to obtain efficient estimates of the Markov 
and independence components. Under a set of sufficient conditions, we provide guarantees in terms 
of sparsistency, sign consistency, and norm guarantees, defined below. 

Definition 2 (Estimation Guarantees). We say that an estimate (Jm,^r) to the decomposition 
problem in given a sample covariance matrix X n , is sparsistent or model consistent, if the 
supports of Jm and X# coincide with the supports of and X|j respectively. It is said to be sign 
consistent, if additionally, the respective signs coincide. The norm guarantees on the estimates is 
in terms of bounds on \\Jm — Jm\\ an d ||X_r — X|j||, under some norm \\-\\. 



3 Analysis under Exact Statistics 

In this section, we provide the results under exact statistics. 

4 Without loss of generality, we limit our analysis to zero-mean Gaussian models. The results can be easily generalized to 
models with non-zero means. 
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3.1 Conditions for Unique Decomposition 

We first provide a set of sufficient conditions under which we can guarantee that the decomposition 
of E* in ([7|) into concentration matrix Jjj^ and residual matrix EJj is uniqu^l We impose the 
following set of constraints on the two matrices: 

(A.O) E* and Jj^ are positive definite matrices, i.e., E* y 0, J| f >- 0. 

(A.l) Off-diagonal entries of are bounded from above, i.e., ||J^||oo,off < A*, for some A* > 0. 
(A. 2) Diagonal entries of E^ are zero: (T,*^).. = 0, and the support of its off-diagonal entries satisfies 

(E^..^0 <=> |(J^)..| = A*, Vi^j. (9) 

(A. 3) For any we have sign((E^) .) . sign((j]^) i .) > 0, i.e, the signs are the same. 

Indeed, the above constraints restrict the class of models for which we can provide guarantees. 
However, in many scenarios, the above assumptions may be reasonable, and we now provide some 
justifications. (A.0) is a natural assumption to impose since we are interested in valid E* and 
Jjm matrices. Condition (A.l) corresponds to bounded off-diagonal entries of Jjj^. Intuitively, 
this limits the extent of "dependence" between the variables in the Markov model, and can lead 
to models where inference can be performed with good accuracy using simple algorithms such as 
belief propagation. Condition (A. 2) limits the support of the residual matrix T, R : the residual 
covariances are captured at those locations (edges) where the concentration entries (J^ij are 
"clipped" (i.e., the bound A* is achieved). Intuitively, the Markov matrix is unable to capture 
all the correlations between the node pairs due to clipping, and the residual matrix E|j captures the 
remaining correlations at the clipped locations. Condition (A. 3) additionally characterizes the signs 
of the entries of EJj. For the special case, when the Markov model is attractive, i.e. {J%i)i.j < for 
i 7^ j, the residual entries (E^)jj are also all negative. This implies that the model corresponding 
to E* is also attractive, since it only consists of positive correlations. By default, we set the diagonal 
entries of the residual matrix to zero in (A. 2) and thus, assume that the Markov matrix captures 
all the variances in the model. In Section 14.2.11 we provide a simple example of a Markov chain 
and a residual covariance model satisfying the above conditions. 

It is also worth mentioning that the number of model parameters satisfying above conditions is 
equivalent to the number of parameters in the special case of sparse inverse covariance estimation 
when A — >■ oo |5j. It is assumed in assumption (A. 2) that the residual matrix E^ takes nonzero 
value when the corresponding entry in the Markov matrix takes its maximum absolute value A*. 
This assumption in conjunction with the sign assumption in (A. 3), exactly determines the Markov 
entry (Jm) - when the corresponding residual entry (E#) . . ^ 0. So, for each (i,j) pair, only one 
of the entries (Jm)^ and (Efl)^. are unknown which results that the proposed model in this paper 
does not introduce additional parameters comparing to the sparse inverse covariance estimation, 
which is interesting. 

We drop the positive definite constraint on the residual matrix EJj thereby allowing for a richer class of covariance 
decomposition. In section 15.31 we modify the conditions and the learning method to incorporate positive definite 
residual matrices E^j. 
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According to the above discussion, we observe that the overall covariance and inverse covariance 
matrices E* and J* = E*" 1 are dense, but represented with small number of parameters. It is 
interesting that we are able to represent models with dense patterns, but it is important to notice 
that the sparse representation leads to some restrictions on the model. 

In the sequel, we propose an efficient method to recover the respective matrices J M and EJj under 
conditions (A.0)-(A.3) and then establish the uniqueness of the decomposition. Finally, note that we 
do not impose any sparsity constraints on the concentration matrix J M , and in fact, our method 
and guarantees allow for dense matrices J M , when the exact covariance matrix E* is available. 
However, when only samples are available, we limit ourselves to sparse JL and provide learning 
guarantees in the high-dimensional regime, where the number of samples can be much smaller than 
the number of variables. 



3.2 Formulation of the Optimization Program 

We now propose a method based on convex optimization for obtaining ( J M , E|j) given the covariance 
matrix E* in ([7]). Consider the following program 

(£a/,£r) := argmax logdet Em- A||Eii|| 1)0 flE (10) 

s. t. E M - Er = S*, (Efl)d = 0, 

where ||-|| lo ff denotes the t\ norm of the off-diagonal entries, which is the sum of the absolute 
values of the off-diagonal entries, and (-)d denotes the diagonal entries. Intuitively, the parameter 
A imposes a penalty on large residual covariances, and under favorable conditions, can encourage 
sparsity in the residual matrix. The program in (|10p can be recast 

(E M , Eft) := arg max log det E M (11) 

s. t. Em - Er = E*, (Eij)^ = 0, ||Eii||i j0 fr < C(A), 

for some constant C(A) depending on A. The objective function in the above program corresponds 
to the entropy of the Markov model (modulo a scaling and a shift factor) [40J, and thus, intuitively, 
the above program looks for the optimal Markov model with maximum entropy subject to an l\ 
constraint on the residual matrix. 

We declare the optimal solution E^ in (TlOl) as the estimate of the residual matrix Y** R , and Jm := E M 
as the estimate of the Markov concentration matrix J M . The justification behind these estimates 
is based on the fact that the Lagrangian dual of the program in (I10p is (see Appendix |A|) 

J M ■= arg min(E* ,J M ) - log det J M (12) 
s. t. || Jm Hoo.off < A 

where W'W^ a g denotes the element-wise norm of the off-diagonal entries, which is the maximum 
absolute value of the off-diagonal entries. Further, we show in Appendix |A] that the following 
relations exist between the optimal primal solution Jm and the optimal dual solution (Em,E#): 

Jm = Em , and thus, J^ 1 — Sjj ; = E* is a valid decomposition of the covariance matrix E*. 

6 Henceforth, we refer to the program in (|12fl as the primal program and the program in (|10|) as the dual program. 
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Remark: Notice that when the constraint is removed in the primal program in (|12p . which 
is equivalent to letting A — > oo, the program corresponds to the maximum likelihood estimate, 
and the optimal solution in this case is Jm = Similarly, in the dual program in (jlOp . when 

A — > oo, the optimal solution corresponds to T,m = and X# = 0. At the other extreme, when 
A — > 0, Jm is a diagonal matrix, and the residual matrix Sjj is in general, a full matrix (except 
for the diagonal entries). Thus, the parameter A allows us to carefully tune the contributions of 
the Markov and residual components, and we notice in our experiments in Section [7] that A plays 
a crucial role in obtaining efficient decomposition into Markov and residual components. 

3.3 Guarantees and main results 

We now establish that the optimal solutions of the proposed optimization programs in fllQ[> and 
(|12p lead to a unique decomposition of the given covariance matrix X* under conditions (A.0)-(A.3) 
given in Section 13.11 

Theorem 1 (Uniqueness of Decomposition). Under (A.0)-(A.3), given a covariance matrix £*, if 
we set the parameter A = || ./L-||oo off i> n ^ e optimization program in (jlOp . then the optimal solutions 
of primal- dual optimization programs (|12p and (1101) are given by (Jm^r) = (Jm^r)> an< ^ 
decomposition is unique. 

Proof: See Appendix [C] □ 

Thus, we establish that the proposed optimization programs in (I10p and (|12p uniquely recover the 
Markov concentration matrix J M and the residual covariance matrix E^ given under conditions 



In this section, we provide the results under sample statistics where some i.i.d. samples of random 
variables are only available. 

4.1 Optimization Program 

We have so far provided guarantees on unique decomposition given the exact covariance matrix £*. 
We now consider the case, when n i.i.d. samples are available from JV"(0, £*), which allows us to 
estimate the sample covariance matrix S n , as in (J8|). 

We now modify the dual program in (|10p . considered in the previous section, to incorporate the 
sample covariance matrix E n as follows 



(A.0)-(A.3). 



4 Sample Analysis of the Algorithm 




(13) 



s.t. ||S n - 



Sm + £.R||oo,off < 1: 




T,m >- 0, Em — £r >- 0. 
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Note that, in addition to substituting X* by X n , there are two more modifications in the above 
program comparing to the exact case in (jlpp . First, the positive-definiteness constraint on the 
overall covariance matrix £ = £m — ^r is added to make sure that the overall covariance matrix 
estimation is valid. This constraint is not required in the exact case since we have the constraint 
£ = S* in that case which ensures the positive-definiteness of overall covariance matrix according 
to assumption (A.O) that S* >- 0. Second, the equality constraint £ a/ — £# = £* is relaxed on 
the off-diagonal entries by introducing the new parameter 7 which allows some deviation. More 
discussion including the Lagrangian primal form of the above optimization program and the effect 
of new parameter 7 is provided in section [6l 

4.2 Assumptions under Sample Statistics 

We now provide conditions under which we can provide guarantees for estimating the Markov 
model J M and the residual model T, R , given the sample covariance £ ra in high dimensions. These 
are conditions in addition to conditions (A.0)-(A.3) in Section [3.11 

The additional assumptions for successful recovery in high dimensions are based on the Hessian 
of the objective function in the optimization program in (j39j) . with respect to the variable Jm, 
evaluated at the true Markov model J M . The Hessian of this function is given by |41j 

r* = r M - 1 ®r M - 1 = x* M ®x* M , (14) 

where ® denotes the Kronecker matrix product [42]. Thus F* is a p 2 x p 2 matrix indexed by the 
node pairs. Based on the results for exponential families [39], „ = Cov{XiXj, X^Xi}, and 

hence it can be interpreted as an edge-based alternative to the usual covariance matrix £tp Define 
Km as the £oo operator norm of the covariance matrix of the Markov model 

Km ■= IXmIL. (15) 

We now denote the supports of the Markov and residual models. Denote Em '■= {{hj) £ V x V\i 7^ 
j, {Jhi)ij 7^ 0} as the edge set of Markov matrix J M . Define 

S M ■= E M U = 1, —,p}, (16) 

S R :={(i,j)eVxV\{X R )..^0}. (17) 

Thus, the set Sm includes diagonal entries and also all edges of Markov graph corresponding to 
J M . Also, recall from (A. 2) in Section 13.11 that the diagonal entries of are set to zero, and 
that the support set Sr is contained in Sm, i-e., Sr C Sm- Let S C M and S C R denote the respective 
complement sets. Define 

S := S M n S C R , (18) 

so that {Sr, S, S m } forms a partition of {(1, x This partitioning plays a crucial 

role in being able to provide learning guarantees. Define the maximum node degree for Markov 
model J M as 

d:= max \{i : E S M }\- (19) 
j=i, ■■■,?> 

Finally, for any two subsets T and T 1 of V x V, Ttp T , denotes the submatrix of V* indexed by T as 
rows and T' as columns. We now impose various constraints on the submatrices of the Hessian in 
(H3|), limited to each of the sets {Sr, S, S m }. 
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(A. 4) Mutual Incoherence: These conditions impose mutual incoherence among three partitions 
of T* indexed by Sr, S c m and S. 

max{|r^ fS (r^) - l F* SSR - T^gJ^, \m iiS {T* ss y l \U < (1 - a) for some a E (0, 1], 

(20) 

1 

4' 



^:=l(ry- 1 r^j| 00 < 7 . (2i) 



(A. 5) Covariance Control: For the same a specified above, we have the bound: 

K SS := KTS.)-'!. < 4(ro ( !' ( ;^ 1)a) fa s„ m e ,„ > 4. (22) 

(A. 6) Eigenvalue Control: The minimum eigenvalue of overall covariance matrix E* satisfies the 
lower bound 



A mi n(£*) > C 6 d\ / log ( 4pT ) + C7d 2 lQ g( 4 P r ) for gome c ^ Ci > o and r > 2. (23) 
V n n 

In (A. 4), the condition in (|20p bounds the effect of the non-edges of the Markov model, indexed by 
S M , to its edges, indexed by Sr and S. Note that we distinguish between the common edges of 
the Markov model with the residual model (Sr) and the remaining edges of the Markov model (£*). 
The second condition in (|21|) controls the influence of the edge-based terms which are shared with 
the residual matrix, indexed by Sr, to other edges of the Markov model, indexed by S = Sm D Sr. 
Condition (A. 5) imposes too bounds on the rows of (r^)~ . Note that for sufficiently large m, the 
bound in ([22]) tends to ^ . Also note that the conditions (A. 4) and (A. 5) are only imposed on 
the Markov model J M and there are no additional constraints on the residual matrix T,* R (other 
than the conditions previously introduced in Section [3 . 1 j) . In condition (A. 6), it is assumed that 
the minimum eigenvalue of overall covariance matrix S* is sufficiently far from zero to make sure 
that its estimation £ is positive definite and therefore a valid covariance matrix. 



4.2.1 Example of a Markov Chain + Residual Covariance Model 

In this section, we propose a simple model satisfying assumptions (A.0)-(A.5). Consider a Markov 
chain with concentration matrix J M over 4 nodes, as shown in Figj2j The diagonal entries in the 
corresponding covariance matrix Y,* M = J£f _1 are set to unity, and the correlations between the 
neighbors in are set uniformly to some value p € (—1, 1), i.e., (S^)^. = p for (i, j) G Em- Due to 

the Markov property, the correlations between other node pairs are given by (S^) 13 = (S^) 24 = p 2 
and (S^ f ) 14 = p 3 . For the residual covariance matrix £j^' we consider one edge between nodes 1 
and 2, i.e., Sr = {(1,2), (2, 1)}. It is easy to see that conditions (A.0)-(A.2) are satisfied. Recall 
that S C M = ■ ^ Em} and the remaining node pairs belongs to set S := Sm\Sr. Through 

some straightforward calculations, we can show that for any \p\ < 0.07, the mutual incoherence 
conditions in (A. 4) and (A. 5) are satisfied for a = 0.855 and m > 83. Note that the value of 
nonzero entries of are not involved or restricted by these assumptions. However, they do 
need to satisfy the sign condition in (A. 3). Thus, we have non-trivial models satisfying the set of 
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Figure 2: Example of a Markov chain and a residual covariance matrix, where a residual edge is present 
between nodes 1 and 2. 

sufficient conditions for successful high-dimensional estimation. In Section we observe that our 
method has good numerical estimation performance even when the above conditions are not fully 
satisfied. 



4.3 Guarantees and Main Results 

We are now ready to provide the main result of this paper. 

Theorem 2. Consider a Gaussian distribution with covariance matrix S* = J%j~ l — T,* R satisfying 
conditions (A.0)-(A.6). Given a sample covariance matrix T, n using n i.i.d. samples from the 
Gaussian model, let (^Jm,^r) denote the optimal solutions of the primal-dual pair (|39[) and (|13|) . 
with parameters 7 = C\ y^ogp/ra and A = A* + C^-y/logp/n for some constants C\,Ci > 0, where 
x * ■= ll J Mlloo,off- Suppose that {^* R ) min := min (iJ)eSil |(£*) ..\ scales as (S|?) min = fi(0ogp/ra) 
and the sample size n is lower bounded as 

n = Q(d 2 logp), (24) 

then with probability greater than 1 — 1 jp c — > 1 (for some c > 0), we have: 

a) The estimates Jm >- and T,r satisfy bounds 

\\Jm-JUoo = o(J^\ (25) 
\\% R -T,\U = o{J^y (26) 

b) The estimate T, R is sparsistent and sign consistent with T,* R . 

c) If in addition, (J* M ) min ■= min (ij)e5M 1(^)^1 scales as (J| 7 ) 
estimate Jm is sparsistent and sign consistent with J M . 

Proof: See Appendix [Dl 

Remarks: 

1. Non-asymptotic sample complexity and error bounds: In the above theorem, we 
establish that the number of samples is required to scale as n = Q(d 2 logp). In fact, our 
results are non-asymptotic, and the exact constants are provided in inequality ()60p . The 
non-asymptotic form of error bounds are also provided in (|65p and (|8ip . 



= £l(y/logp/n) , then the 
□ 
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2. Extension to sub-Gaussian and other distributions: In the above theorem, we con- 
sidered Gaussian distribution. Similar to high dimensional covariance estimation in [5], the 
result in the theorem can be easily extended to sub-Gaussian and other distributions with 
known tail conditions. 

3. Comparison between direct estimation of S* and the above decomposition: The 

overall matrix S* (and J*) is a full matrix in general. Thus, if we want to estimate it directly, 
we need n = il(p 2 logp) samples since the maximum node degree is Q(p). Therefore, we can 
not estimate it directly in high dimensional regime and it demonstrates the importance of 
such sparse covariance + inverse covariance models for estimation. 

We discussed in the remark in section 13.21 that the parameter A allows us to carefully tune the 
contributions of the Markov and residual components. When A — > oo, the program corresponds to 
^i-penalized maximum likelihood estimator which is well-studied in OE]. In this case, £r = 
and all the dependencies among random variables are captured by the sparse graphical model 
represented by Jm- On the other extreme, when A* = and thus A = Q2 ydogp/n — > 0, with 
increasing the number of samples n, the off-diagonal entries in Jm are bounded too tight by A 
(refer to the primal program in (I39p ) and therefore the residual covariance matrix captures 
most of the dependencies among random variables. In this case, we have the covariance estimation 
S = Y>m — ^r, where the diagonal entries are included in Em and the off-diagonal entries are 
mostly included in — In order to explain the results for these cases in a more concrete way, we 
explicitly mention the results for both sparse inverse covariance estimation (A — > 00) and sparse 
covariance estimation (A ~ 0) methods in the following subsections. Note that both of these are 
special cases of the general result expressed in Theorem [2l Thus, in Theorem [21 we generalize 
these extreme cases to models with a linear combination of sparse covariance and sparse inverse 
covariance matrices. 

5 Discussions and Extension 

In this section, we first provide a detailed discussion of special cases sparse covariance and sparse 
inverse covariance estimation. Then, the extension of results to the structured noise model is 
mentioned. 

5.1 Sparse Inverse Covariance Estimation 

In this section, we mention the result for sparse inverse covariance estimation in high dimensional 
regime. This result is provided by [5] and is a special case of Theorem [2] when the parameter A goes 
to infinity. Before proposing the explicit result in Corollary [H we state how the required conditions 
in Theorem [2] reduces to the conditions in [5]. 

Since the support of residual matrix Y>* R is a zero matrix in this special case, the mutual incoherence 
conditions in (A. 4) reduce exactly to the same mutual incoherence condition in [5] as 




some a G (0, 1], 



(27) 
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where S = Sm is the support of Markov matrix J* = Jt, as defined in (|16p . Also note that the 
covariance control condition (A. 5) is not required any more. 

Furthermore, the sample complexity and convergence rate of estimation in Theorem [2] exactly 
reduce to the results in [5] as (for q = 8, 1 = 3) 

n > n } (^p T ; 1/max j^,2w(l + £) K SS K M max{ 1, ^-(l + ^jK ss K 2 M ^ , (28) 

|| J- J*!U <2K ss (l + ^)5 f (p T ;n), (29) 
where the result is valid for any q > 8 and I > 1. 

Corollary 1 (Sparse Inverse Covariance Estimation, [5j). Consider a Gaussian distribution with 
covariance matrix X* = J* -1 satisfying mutual incoherence condition ()27p . Given a sample covari- 
ance matrix S n using n i.i.d. samples from the Gaussian model, let J denote the optimal solution 
of the primal-dual pair (|39p and (|13p . with parameters 7 = C\ y^log p/n and X — > 00 (removing 
constraints in the primal program (|39j) ) for some constant C% > 0. Suppose that the sample size n 
is lower bounded as 

n = Q(d 2 logp), (30) 
then with probability greater than 1 — l/p c — > 1 (for some c > 0), we have: 
a) The estimate J >- satisfies bound 



\J-J\U-0[^). (31) 



b) If in addition (^*) min := m ^ n (i,j)^s M l(^*)j-l scales as (J*) min = ^(\/logp/n), the estimate 
J is sparsistent and sign consistent with J* . 

Remark [Comparison between the results of general case (Theorem [2]) and sparse inverse covari- 
ance estimation case (Corollary [TJ]: Considering the results in Theorem [21 sample complexity and 
convergence rate of estimated models are exactly the same as results in [5] with only some minor 
differences in coefficients. Compare (|60p with (|28p for sample complexity and (|65p with (|29p for 
convergence rate of estimated Markov matrix Jm- But regarding the mutual incoherence condi- 
tions, we observe that the conditions for the special case sparse inverse covariance estimation in 
(|27|) are less restrictive than the conditions for the general case in (|2U|) - (|21|) . Since the sparse in- 
verse covariance estimation [5] is a special case of the general model in this paper, this additional 
limitation on models is inevitable, i.e., it is natural that we need some more incoherence conditions 
in order to be able to recover both the Markov and residual models in the general case. 



5.2 Sparse Covariance Estimation 

High-dimensional estimation of sparse covariance models has been studied in [6]. They propose 
an estimation of a class of sparse covariance matrices by "hard thresholding". They also prove 
spectral norm guarantees on the error between the estimated and exact covariance matrices. We 
also recover similar results in the other extreme case of proposed program (|13p when A ~ 0. The 
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program reduces to the sparse covariance estimator as discussed earlier. In order to see that again, 
let us investigate the dual program restated as follows 



(Sm,X#) :=argmax logdetXW - A||£fl||i )0 ff 
s. t. || S n - Em + Sijlloo^fr < 7, 

&R) d = 0, 

S M y 0, S M - E R y 0. 



When the parameter A ~ 0, the variable S/j is very slightly penalized in the objective function. 
Therefore, most of the statistical dependencies are captured by and thus, off-diagonal entries 
of take very small values. Furthermore, according to the property of optimization program 
that the support of T,r is contained within the support of Jm , sparsity on is encouraged by the 
effect of parameter 7. 

It is also observed that we are approximately performing "soft thresholding" in program (|13p (when 
A ~ 0) comparing to "hard thresholding" in [6j. Consider the case A = 0, where the Markov part 
T,m is a diagonal matrix. Therefore, the ||S n — T<m + X.R||oo,off < 7 constraint in the dual program 
(fT3l) reduces to ||£ n -|-5].R||oo,off < 7 where it is seen that the negative soft thresholding is performed 
on matrix S n with threshold parameter 7, given by 

Sj(x) = sign(-x)(|x| - 7)+. (32) 

Notice that we need to have A ~ for recovering the sparse covariance matrix given empirical 
covariances and in this case, we can view the estimator as approximately performing soft thresh- 
olding. 

Finally, we propose the corollary for this special case. Before that, we need some additional 
definitions for a general covariance matrix S*. Similar to definition (|17p . the support of a covariance 
matrix S* is defined as 

S x :={(i,j)eVxV\E*j^0}. (33) 
The maximum node degree for a covariance matrix S* is also defined as 

d E := max |{i : e 5 E }|. (34) 

3=1,-,P 

Corollary 2 (Sparse Covariance Estimation). Consider a Gaussian distribution with covariance 
matrix S* satisfying eigenvalue control condition (A. 6). Given a sample covariance matrix S n 
using n i.i.d. samples from the Gaussian model, let CEm,^r) denote the optimal solutions of 
the primal-dual pair (139p and (j!3j) . with parameters 7 = C\y/logp/n and A = C2 y/log p/n for 
some constants Ci,C% > 0. The estimated covariance matrix £ is defined as S Q fj := — S_r and 
S rf := {E M ) d - Suppose that (^ ff ) min := mm^s^^ |(S*) y | scales as (X* ff ) min = n{y/log p/n) 
and the sample size n is lower bounded as 

n = n(4logp), (35) 

then with probability greater than 1 — l/p c — > 1 (for some c > 0), we have: 
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a) The estimate E satisfies bound 



J off • 



6j T/ie estimate £ g- zs sparsistent and sign consistent with S* 
Proof: See Appendix [FJ □ 



5.3 Structured Noise Model 



In the discussion up to now, we considered general residual matrices Sp, not necessarily positive 
definite, thereby allowing for a rich class of covariance decomposition models. In this section, we 
modify the conditions and the learning method to incorporate positive-definite residual matrices 

We regularize the diagonal entries in an appropriate way to ensure that both and S|j are 
positive definite. Thus, the identifiability assumptions (A.0)-(A.3) are modified as follows: 

(A.O') £*, SJj and J* M are positive definite matrices, i.e., S* >~ 0, T,* R y 0, >~ 0. 

(A.l') Jfij is normalized such that (J^V) . = AJ for some AJ > and off-diagonal entries of are 
bounded from above, i.e., ||J^||oo,off < A|, for some A| > 0. 

(A. 2') The off-diagonal entries of £Jj satisfy 

(££W0 l(Jw) w l=A5, V»Vj. (37) 



(A. 3') For any we have sign((SJj) .) . sign((j^) .) > 0, i.e, the signs are the same. 

It is seen in (A.l') that we put additional restrictions on diagonal entries of the Markov matrix 
in order to have nonzero diagonal entries for the residual matrix Sp. 

Similar to the general form of dual program introduced in (|44p , we propose the following optimiza- 
tion program to estimate the Markov and residual components in the structured noise model: 

(Em,Er) := argmaxlogdetEA/ - Ai||£R||i )0n - A 2 ||£fl||i.off 

s.t. ll^ + Etf-EMlUoff <7, (38) 
(Z n ) d +£ R ) d =(Z M ) d . 

The decomposition result under exact statistics can be similarly proven by setting parameter 7 = 
when the identifiability assumptions (A.0')-(A.3') are satisfied. Furthermore, under additional 
estimation assumptions (A.4)-(A.6), the sample statistics guarantees in Theorem [2] can be also 
extended to the solutions of above program. 
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6 Proof Outline 



In this section, the Lagrangian primal form for the proposed dual program (|13p is provided first 
and then the proof outlne is presented. For now, we drop the positive-definiteness constraint 
Em — TIr >- in the proposed dual program (fT3|) . We finally show that this constraint is satisfied 
for the proposed estimation under specified conditions and thus this constraint can be dropped. 
In the subsequent discussion, we drop this constraint. It is shown in Appendix [X] that the primal 
form for this reduced dual program is 

J M ■■= argmin (S n , J M ) -logdet J M + 7 ||^A/||i,off (39) 

S- 1. || Jm || oo,off < \ 

We further establish that Hm = J M ^ is valid between the dual variable £m and primal variable Jm 
and thus, 

||E"-J^ + E i? || 0OiOff < 7 . (40) 

Comparing the above with the exact decomposition S* = — £|j in ([7]), we note that for the 

sample version, we do not exactly fit the Markov and the residual models with the sample covariance 
matrix T, n , but allow for some divergence, depending on 7. Similarly, the primal program (|39ft has 
an additional l\ penalty term on Jm, which is absent in (|12|) . Having a non-zero 7 in the primal 
program enables us to impose a sparsity constraint on Jm, which in turn, enables us to estimate the 
matrices in the high dimensional regime (p~^> n), under a set of conditions of sufficient conditions 
given in section 14.21 

We now provide a high-level description of the proof for Theorem [2 The detailed proof is given 
in Appendix [D] The proof is based on the primal-dual witness method, which has been previously 
employed in [5] and other works. However, we require significant modifications of this approach in 
order to handle the more complex setting of covariance decomposition. 

In the primal-dual witness method, we define a modified version of the original optimization pro- 
gram (139 j) . Note that the key idea in constructing the modified version is to be able to analyze 
it and prove guarantees for it in a less complicated way comparing to the original version. Let us 
denote the solutions of the modified program by (Jm, Sr) pair. In general, the optimal solutions 
of the two programs, original and modified one, are different. However, under conditions (A.0)- 
(A.5), we establish that their optimal solutions coincide. See Appendix [D] for details. Through 
this equivalence, we thus establish that the optimal solution (Jm,^>r) of the original program in 
(|39p inherits all the properties of the optimal solution (Jm, ^r) of the modified program, i.e., the 
solutions of the modified program act as witness for the original program. In the following, we de- 
fine the modified optimization program and its properties. The primal-dual witness method steps 
which guarantee the equivalence between solutions of the original and the modified program are 
mentioned in Appendix [Dl 

We modify the sample version of our optimization program in ([39]) as follows: 

J M ■= argmin J M ) - logdet J M + 7 || JmIIi.oA (41) 

s.t. {Jm) S c m = 0, {Jm) Sr = Asign((JX / ) 5fl ). 
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Figure 3: The sets Sr, S and S M form a partition of {(1, ...,p) x (1, ...,p)}, where p is the number of nodes, 
Sr is the support of the residual covariance matrix T,* R and Sm is the support of the precision matrix of 
the Markov model and S M is its complement. 



Note that since we do not a priori know the supports of the original matrices J M and EJj, the above 
program cannot be implemented in practice, but is only a device useful for proving consistency 
results. We observe that the objective function in the modified program above is the same as the 
original program in (|39p . and only the constraints on the precision matrix are different in the two 
programs. In the above program in (|4ip . constraints on the entries of the precision matrix when 
limited to sets Sr and S C M are more restrictive, while those in set S := Sm \ Sr are more relaxed 
(i.e., the £oo constraints present in (139p are removed above), compared to the original program 
in ([39]) . Recall that Sm denotes the support of the Markov model, while Sr C Sm denotes the 
support of the residual or the independence model. See Figj3j 



We now discuss the properties of the optimal solution ( Jm^r) of the modified program in (JHJ). 
Since the precision matrix entries on S C M are set to zero in (j4T|) . we have that Supp(Jjw) Q 
Supp(J| / ). Denoting Y>r as the residual covariance matrix corresponding to the modified program 
(|4ip. we can similarly characterize it in the following form derived from duality: 

(f- \ = J for G S (AO) 

{ R)i i 1 for (i,j)eS R ,S c M , { > 

where are the Lagrangian multipliers corresponding to the equality constraints in the modified 
program (I4ip . 



Define estimation errors Aj := Jm~ J M an d Ar := T,r — T, r for the modified program in (j4"T|) . It is 
easy to see that (A j) Sr = A<j, (Aj) sc = 0, (Ar) s = 0, where A,5 := A — A* > 0. This implies that 

in any of the three sets S, Sr or S M , only one of the two estimation errors Aj or A# can be non-zero 
(or is at most Xs). This property is crucial to be able to decouple the perturbations in the Markov 
and the independence domains, and thereby gives bounds on the individual perturbations. It is not 
clear if there is an alternative partitioning of the variables (here the partition is S, Sr and S M ) 
which allows us to decouple the estimation errors for Jm and Y*r. Through this decoupling, we are 
able to provide bounds on estimation errors Aj and Ar and thus, Theorem [2] is established. 



7 Experiments 

In this section, we provide experimental results for the proposed algorithm. We term our proposed 
optimization program as l\ + method and compare it with the well-known l\ method which is a 
special case of the proposed algorithm when A = oo. The optimization programs are implemented 
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by YALMIP [43J and SDPT3 [2] packages for MATLAB. We also compare the performance of 
applying belief propagation to exact models. 

Synthetic Data: We build a Markov + residual synthetic model in the following way. The 
underlying graph for the Markov part is an 8 x 8 2-D grid structure (4- nearest neighbor grid). We 
choose the off-diagonal nonzero entries in J M (corresponding to the grid edges) randomly from set 
{—0.2,0.2}, i.e., [Jj^Ji- ^ {—0.2,0.2} for all i,j G Em- Then we ensure that J* M is positive definite 
by adding some uniform diagonal weighting. We choose 0.5 fraction of Markov edges randomly to 
introduce residual edges. The value of these nonzero entries in T,* R are chosen from {—0.2, 0.2} such 
that the sign of residual entry is the same as the sign of overlapping Markov entry (assumption 
(A. 3)). We also generate a random mean in the interval [0, 1] for each variable. 

Before we go through experiment results, it is worth mentioning that the realization of above 
model is an example that both Markov and residual matrices J M and H R are sparse while the 
overall covariance matrix X* = J M ~ l — E|j and concentration matrix J* = X* -1 are both dense 
matrices. 

We apply t\ + and t\ methods to a random realization of the above described model X* = 
Jl^ 1 — XJj. The edit distance between estimated and exact Markov model Jm and J M is plotted 
in figure HJa for different number of samples. First observation is that by increasing the number of 
samples, the edit distance decreases which is consistent with theoretical results. We also see that 
the behaviour of l\ + method is very close to l\ method which suggests that sparsity pattern 
of J M can be estimated efficiently under either methods. The edit distance between T,r and T,* R 
is plotted in figure Hlb. We again see decreasing trend for t\ + method here with increasing 
the number of samples. But since there is not any off-diagonal constraints in l\ method, it can 
not recover the residual matrix X|j. Finally the ^oo-elementwise norm of error between estimated 
precision matrix J and the exact precision matrix J* is sketched for both methods in figure 01c. 
We observe the advantage of proposed i\ + method in estimating the overall model precision 
matrix J* = X* -1 . 

Next, the results of running Loopy Belief Propagation (LBP) on the models are presented. We 
compare the result of applying LBP to the same J* and J M models generated in the learning 
discussion above. The log of average mean and variance errors over all nodes are sketched in figure 
[5] throughout the iterations. We observe that LBP does not converge for J* model. It is shown in [8] 
that if a model is walk-summable, then the mean estimates under LBP converge and are correct. 
The spectral norms of the partial correlation matrices are |||.Rm||| = 0.8613 and |||i?||| = 3.2446 
for J M and J* models respectively. Thus, the matrix J* is not walk-summable and therefore its 
convergence under LBP is not guaranteed and this is seen in figure [5j On the other hand, LBP 
is accurate for J M matrix. Thus, our method learns models which are better suited for inference 
under loopy belief propagation. 

Foreign Exchange Rate Data: In this section, we apply the proposed algorithm to the foreign 
exchange rate data se10. The dataset includes monthly exchange rates of 19 countries currency 
with respect to US dollars from October 1983 to January 2012. Thus, the dataset has 340 samples 
of 19 variables. We apply the optimization program (I13h with a slight modification. Since the 
underlying model for this data set does not necessarily satisfy the proposed eigenvalue condition 

7 Dataset available at http://research.stlouisfed.org/fred2/categories/15/downloaddata 
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Figure 4: (a) Edit distance between estimated Markov model Jm and exact Markov model J M . (b) Edit 
distance between estimated residual model and exact residual model Yi* R . (c) precision matrix estimation 
error || J* — JHoo , where J = Jm for i\ method and J = { J M X — £r) 1 for i\ + method. 

(A. 6), we need to make sure that the overall covariance matrix estimation E is positive definite 
and thus a valid covariance matrix. We add an additional constraint to the optimization program 
(|13p . imposing a lower bound on the minimum eigenvalue of overall covariance matrix A m j n (S), 
i.e., \ m i n {Ti) > o m i n . The parameter a m i n is set 0.001 in this experiment. 

The resulting edges of Markov and residual matrices for some moderate choice of regularization 
parameters 7 = 20 and A = 0.004 are plotted in figure There is sparsity on both Markov and 
residual structures. There are two main observations in the learned model in figure El First, it is 
seen that the statistical dependencies of foreign exchange rates are correlated with the geographical 
locations of countries, e.g., it is observed in the learned model that the exchange rates of Asian coun- 
tries are more correlated. We can refer to Asian countries "South Korea", "Japan" , "China" , "Sri 
Lanka" , "Taiwan" , "Thailand" and "India" in the Markov model where several edges exist between 
them while other nodes in the graph have much lower degrees. We observe similar patterns in the 
residual matrix, e.g., there is an edge between "India" and "Sri Lanka" in the residual model. We 
also see the interesting phenomena in the Markov graph that there exist some high degree nodes 
such as "South Korea" and "Japan" . The presence of high degree nodes suggests that incorporating 
hidden variables can further lead to sparser representations, and this has been observed before in 
other works, e.g., [7j, [33] and [4"5] . 

Monthly Stock Returns Data: In this section we apply the algorithm to monthly stock returns 
of a number of companies in the S&P 100 stock index. We pick 17 companies in divisions "E. Trans, 
Comm, Elec&Gas" and "G. Retail Trade" and apply the optimization program (|39p to their stock 
returns Data to learn the model. The resulting edges for Markov and residual matrices are plotted 
in figure [3 There is sparsity on both Markov and residual structure. The isolated nodes in the 
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iteration iteration 




Figure 6: Markov and independence graph structures for the foreign exchange rate data set with regular- 
ization parameters 7 — 20 and A = 0.004. Solid edges indicate Markov model and dotted edges indicate 
independence model. 

Markov graph are not presented in the figure. We see in both Markov and residual graphs that there 
exist more correlations among the nodes in the same division or industry. There are 5 connected 
partitions in the residual graph, e.g. nodes "HD", "WMT" , "TGT" and "MCD", all belonging to 
division Retail Trade, form a partition. This is also observed for the telecommunication industries 
(companies "T" and "VZ" ) and energy industries (companies "ETR" and "EXC" ) . We see a similar 
pattern in the Markov graph but with more edges. Similar to exchange rate data set results, we 
also observe high degree nodes in the Markov graph such as "HD" and "TGT" which suggest 
incorporating hidden nodes. 

8 Conclusion 

In this paper, we provided an in-depth study of convex optimization methods and guarantees 
for high-dimensional covariance matrix decomposition. Our methods unify the existing results 
for sparse covariance/precision estimation and introduce a richer class of models with sparsity 
in multiple domains. We provide consistency guarantees for estimation in both the Markov and 
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Figure 7: Markov and independence graph structures for the monthly stock returns data. Solid edges 
indicate Markov model and dotted edges indicate independence model. 

the residual domains, and establish efficient sample complexity results for our method. These 
findings open up many future directions to explore. One important aspect is to relax the sparsity 
constraints imposed in the two domains, and to develop new methods to enable decomposition 
of such models. Other considerations include extension to discrete models and other models for 
the residual covariance matrix (e.g. low rank matrices). Such findings will push the envelope of 
efficient models for high-dimensional estimation. It is worth mentioning while in many scenarios it 
is important to incorporate latent variables, in our framework it is challenging to incorporate both 
latent variables as well as marginal independencies, and provide learning guarantees, and we defer 
it to future work. 
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Appendix 

A Duality Between Programs 

In this section we prove duality between programs f)39|) and (|13p (when the positive-definiteness 
constraint Y*m — £r >- is dropped). By doing this, the duality between programs (fT2|) and (fTU|) is 
also proved since they are special cases of (f39l) and (fl~3l) when 7 is set to zero and S n is substituted 
with £*. 

Before we prove duality, we introduce the concept of subdifferential or subgradient for a convex func- 
tion not necessarily differentiable. Subgradient (subdifferential) generalizes the gradient (derivative) 
concept to nondifferentiable functions. Supposing convex function / : M. n — >• R, the subgradient at 
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a point xq which is usually denoted by df(xo) consists of all vectors c such that 

f(x)>f(x ) + (c,x-x ), VxGDom/. 



(43) 



In order to prove duality, we start from program (|13p (when the positive-definiteness constraint 
Em — y is dropped) and derive the primal form (f39|) . Program (fl~3l) can be written in the 
following equivalent form where Ai goes to infinity and A2 is used instead of A. 

(£m,£r) := argmaxlogdetEM - Ai||£n||i )0 n - A 2 ||£fl||i l0 ff 

s.t. \\^ n -^M + ^ R \\oo,oS <7, (44) 

( S M) d "(^) d =( £n ) d - 

By introducing the dual variable Jm for above program, we have: 

min -{Jm,^r) = -Ai||£r||i,oii - A 2 ||S^|| li0ff , (45) 

llJvf II 00,011 <Ai 
l|^M||oo,off<^2 

where (Jm)oii £ Ai<9||£^||i i0n , (Jm)oS € A20||£fl||i lO ff minimizes the above program. Thus, we have 
the following equivalent form for program (|44p : 

min max logdet — (Jj\,/, S/j), (46) 

IIJull 00 , off 

<A 2 ||E"-S M +S fl |U. off <7 
(E AJ ) d -(E fl ) d =(E") d 

where the order of programs is exchanged. If we define the new variable X = Em — ^R, an d use E 
as the new variable in the program instead of E#, the inner max program becomes 

max logdet E M - (Jm, Em) + (Jm, E). (47) 

l|S"-S|| 00 , off < 7 ,S d =(E") d 

Since the objective function and constraints are disjoint functions of variables E and Em, we can 
do optimization individually for two variables. The optimizers are Em = J™ anc ^ ^ = ^™ + 7^7 > 
where Z 7 is a member of the subgradient of || • ||i )0 ff evaluated at point Jm, i-e., 



(Zy)ij 



for i = j 

€[-1,1] for i£j,[ J M)ij= Q (48) 

sign((J M )^) for i / j, (Jm)^- / 0. 



Also note that since Em should be positive definite, the variable Jm should be also positive definite. 
Therefore, it adds another constraint Jm >- 0. If we substitute these optimizers, we get the dual 
program 

min (E n , J M ) -logdet J M + 7ll<4f ||l,off, (49) 

Jm>-0 

\\Jm\\ X.Oli 

<Ai 

II^Af lloo,off<A2 



which is equivalent to (|39p when Ai goes to infinity and therefore the result is proved. 
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B Characterization of the Proposed Optimization Programs 



We proposed programs (|12|) and (|39|) to do decomposition and estimation respectively. Former is 
used to decompose exact statistics to its Markov and residual covariance components and the latter 
is used to estimate decomposition components given sample covariance matrix. In this appendix 
we characterize optimal solutions of these optimization programs. Both programs are convex and 
therefore the optimal solutions can be characterized using standard convex optimization theory. 
Note that the proof of following lemmas is mentioned after the remarks. 

Lemma 1. For any A > ; primal problem (fT2j) has a unique solution Jm >- which is characterized 
by the following equation: 

S*-Jh 1 + Z = 0, (50) 



where Z has the following form 



Z, 



for i = j 

^ for i^j,\[JM) ij \<X 

aij sign ((J M ) {j ) for % ^ j, {(Jm)^ = A, 



(51) 



in which a~ij can only take nonnegative values, i.e., we have a~ij > 0. 

Remark: Comparing Lagrangian optimality condition in (150|) with relation S* = J m 1 — Hr between 
solutions of primal-dual optimization programs (derived in Appendix |A|) implies the equality = 
Z. Thus, X# entries are determined by Lagrangian multipliers of primal program. More specifically, 
we have 

for i = j 

for i^ j,\{J M ) lJ \<\ (52) 



a 



y 1 



sign((j M ) ij -) for i^j,\(Ju) i 
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A. 



where Sj,- > are the Lagrangian multipliers of primal program (|12|) . 



Lemma 2. For any A > 0, 7 > and sample covariance matrix S n with strictly positive diagonal 
entries, primal problem (|39p has a unique solution Jm >~ which is characterized by the equation 



0; 



(53) 



where Z = Z a +jZ^. Matrix Z 7 G d\\ Jm||i oft an d Z a is represented as in f)51 1) for some Lagrangian 
multipliers > 0. 

Remark: Comparing Lagrangian optimality condition in (|53p with relation S n = J^ 1 — — 
7^ 7 between solutions of primal-dual optimization programs (derived in Appendix [A]) implies the 
equality £_r = Z a . Thus, Y,r entries are determined by the Lagrangian multipliers of primal 
program. More specifically, we have 



R)ij 



for i = j 

for i^j,\(JM)ij\ < A 

sign(( Jm) for i ^ j, | (Jm)^ I = A, 



(54) 
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where > are the Lagrangian multipliers of primal program f|39|) . 



Proof: We prove Lemma [2] here and Lemma [T] is a special case of that when 7 is set to zero and 
E ra is substituted with E*. 

For any A > and 7 > 0, the optimization problem (|39p is a convex programming where the objec- 
tive function is strictly convex. Therefore, if the minimum is achieved it is unique. Since off-diagonal 
entries of Jm are bounded according to constraints, the only issue for minimum achievement may 
arises for unbounded diagonal entries. It is shown in [5] that if diagonal entries of E n are strictly 
positive, the function is coercive with respect to diagonal entries and therefore here is no issue 
regarding unbounded diagonal entries. Thus, the minimum is attained in Jm t 0. But since when 
Jm approaches the boundary of positive definite cone, the objective function goes to infinity, the 
solution is attained in the interior of the cone Jm >~ 0. After showing that the unique minimum is 
achieved, let us characterize the minimum. 



Considering as Lagrangian multipliers of inequality constraints of program (|39p , the Lagrangian 
function is 



C(J M ,a) = (E n , J M ) - logdet J M + i\\Jm ||i,off + /^2 a ij [|(^m)^| - A]. 



(55) 



We skipped positive definiteness constraint in writing Lagrangian function since it is inactive. Based 
on standard convex optimization theory, the matrix Jm >~ is the optimal solution if and only 
if it satisfies KKT conditions. It should minimize the Lagrangian which happens if and only if 
belongs to the sub differential of Lagrangian or equivalently there exists a matrix Z such that 



E" - J M l + Z = 0, 
where Z = Z a + 7^7 ■ Matrix Z 7 € d\\ JA/||i,off and Z a is 



(56) 





£ a 



for i = j 
-1,1] for i^j, (J M ). 



(57) 



a 



sign((j M )ij)) for i ^ j, (JmL- + 0, 



for some Lagrangian multipliers 2jj > 0. The solution should also satisfy complementary slackness 
conditions 2^-. [| (Ju) i \ — A] = for i ^ j. Applying this condition to above Z a representation, 



results to (|51j) form proposed in the lemma. 



□ 



C Proof of Theorem [j] 

First note that as mentioned in the remark in section [321 the- pair (Jm> E#) given by optimization 
program gives a decomposition E* = J^ 1 — E^j which is desired. 

Next, in order to prove the equivalence, we show that there is a one to one correspondence between 
the specified conditions (A.0)-(A.3) for valid decomposition and the characterization of optimal 
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solution of optimization program given in lemma [TJ We go through each of these conditions one by 
one in the following lines. 

Condition (A.O) is considered in optimization program as positive definiteness of Markov matrix 
Jm- 

Condition (A.l) is exactly the primal constraint \\Jtf ||oo,off 

< A. 

Condition (A. 2) is exactly the same as relation (I52p where diagonal entries of residual covariance 
matrix are zero and its off-diagonal entries can be nonzero only if the absolute value of corresponding 
entry in Markov matrix takes the maximum value A. 
Condition (A. 3) is exactly the same as inequality ctij > 0. 

In the above lines, we covered one by one correspondence for conditions (A.0)-(A.3). But note 
that we also covered all the equalities and inequalities that characterize unique optimal solution of 
optimization program. In other words by above correspondence we proved that both of the following 
derivations are true where second one is the reverse of first one. On one hand, any optimal solution 
of optimization program gives a valid decomposition under desired conditions. On the other hand, 
any valid decomposition under desired conditions is a solution of proposed optimization program. 
Thus, we can infer that these two are exactly equivalent and the result is proved. Since the solution 
of optimization program is unique and according to the equivalence between this solution and 
decomposition under those conditions, uniqueness is also established. □ 

D Proof of Theorem [2] 

In this appendix, we first mention an outline of the primal-dual witness method and then provide 
the detailed proof of the theorem. 

D.l Primal-Dual Witness Method 

First, continuing the proof outline presented in section [U we provide an outline of the primal-dual 
witness method steps in order to establish equivalence between optimal solutions of the original 
(|39p and the modified (I4ip optimization programs. 

1. The primal witness matrix Jm is defined as in f|41 1) . 

2. The dual witness matrix is set as Z = — S n + J7} ■ It is defined in this way to satisfy original 
program optimal solution characterization mentioned in appendix [Bl 

3. We need to check the following feasibility conditions under which the modified program 
solution is equivalent to the solution of original one: 

(a) || Jm ||oo,off,S — Since we relaxed the bounds on off-diagonal entries in set S, 
we need to make sure that the modified solution satisfies this bound in order to have 
equivalence between modified and original programs solutions. 

(b) Set {Z a ) SR = {-T, n +J M 1 -l{Z 1 )) SR where Z 7 6 d\\J M \\ 1;OS . Note that since | (Jm)^! = 
A 7^ for any £ Sr, then and therefore above equation is well-defined. Now we 
need to check: (Z a ^).. (Jm\ - > for all € Sr. This means that they have the same 
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sign or one of them is zero. We need this condition for equivalence between solutions 
because Lagrangian multipliers in original program (|39p corresponding to inequality 
constraints should be nonnegative. 

(c) H^Hoo s c < 7: According to the \Jm) c c = constraint in the modified program, all 

the inequality constraints become inactive in the original one when desired Jm = Jm 
equality is satisfied. Then, complementary slackness condition enforce all the Lagrangian 
multipliers corresponding to set S C M to be zero. These can be satisfied by the above strict 
dual feasibility. Also note that having zero Lagrangian multipliers results in zero residual 
entries, i.e., (Er) qc = and therefore II A^lloo s c = when this feasibility condition is 

V M ' M 

satisfied. 

Also note that we dropped the positive-definiteness constraint £m — Y,r y in the proof outline. 
Thus, in addition to above conditions, we also need to show that S = Sjv/ — S^j y in the modified 
program. 

Before we state the detailed proof for the theorem, we introduce a pair of definitions which are 
used in the analysis. Let us define matrix E as difference between sample covariance matrix and 
the exact covariance matrix 

£:=£"-£*. (58) 
We also define i?(Aj) as the difference between J M X and its first order Taylor expansion around 

J* M . Recall that Aj was defined as Aj := Jm — J*m- According to results for first order derivative 
of inverse function J M l [H] , the remainder is 

Rfij) = j m - Jif 1 + Jir l ^jJir l - (59) 



D.2 Proof of the Theorem 



Exploiting lemmata mentioned in Appendix |El the Theorem [2] is proved as follows: 
Proof: According to the sample error bound mentioned in Lemma HI we have ||-E , || o ^ &f{p r ] n ) 
for some r > 2 with probability greater than or equal to 1 — l/p T ~ 2 . In the discussion after this, it 
is assumed that the above bound for 1 1 _Zi7 1 1 is satisfied and therefore the following results are valid 
with probability greater than or equal to 1 — l/p T ~ 2 . 



By choosing 7 = ^5f(p T ; n), we have ||-E||oo < ^7 as desired for Lemma [SJ Choosing \$ as in ([73]) 
(compatible with what mentioned in the theorem), we only need to show that the other bound on 
ll-RII 

oo is also satisfied to be able to apply Lemma [5l As stated in the remark after Theorem [2j the 
bound on sample complexity is not asymptotic and we assume the following lower bound on the 
number of samples which is compatible with the asymptotic form mentioned in the theorem: 

n > n f (p T ; l/maxju*, Ald(l + ^K SS K M max{l, ^-(l + ^K SS K 2 M ^ , (60) 

for some I > 1. Because of monotonic behaviour of the tail function, for any n satisfying above 
bound, we have: 

MpT]n) ~ mm {h m(i + *)k ss k m > im { i TW^m\ (61) 
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According to the selection for regularization parameters Xs and 7 and the bound on sample error 
Halloo, we have: 



r := 2K SSr X S + 2K SS {\\E\\ 00 + 7) < 



6 f (p T ;n) (62) 



1 — 2Kgs R ^ m ' a \ a 

= 2K S s(l + -)Sf(p T ;n) 1 (= A,) 

V a J 1 - 2K S s R 

1 + — )5 f {p T ;n), 

where in the last inequality, we used the second condition is assumption (A. 4) that K$s R < 1/4. 
Note that second line is equal to Xs since we assigned the same value in (j73|) . Applying the bound 
(|6ip on above inequality, we have 

+ + 7) < ndo {jJL-, 4M(1+ 'g^ sAl } («) 



< min 



1 Z-l 



ZdFT M ' 2ldK SS K 3 M 
Thus, the conditions for Lemma [7] are satisfied and we have 

\\^j\\oo,S<2Kss R X S + 2K ss {\\E\\ 00 + 1 ) <Xs<AK SS (l + -)5 f (p T ;n). (64) 

Above inequalities tell us multiple things. First, since the error HAjHoo^ is bounded by Xs, the 
Jm entries in set S can not deviate from exact one Jj, more than Xs- We also assumed that the 
off-diagonal entries in J* M are bounded by A*. Therefore according to the definition of Xs := A — A*, 
the entries in {Jm) oS s ar e bounded by A and therefore the condition (a) for feasibility of primal- 
dual witness method is satisfied, i.e., we have || JM||oo,off,,s < A. Second, since ||Aj||oo tl g^j — Xs, we 
have ||Aj||oo,5 < HAjHoo^ and therefore ||Aj||oo = || ^-j\\oo,S R = ^5 which results the following 
error bound 

IIAjIU := \\J M - JmIU < 4^55(l + -)S f (p T ;n). (65) 

Furthermore, HAjHoo < bound can be concluded from above inequality by substituting 

5f(p T ;n) from (|6ip . Thus, the condition for Lemma[6]is satisfied and we have the following bound 
on the remainder term 

\\R{Aj)\\^ < j^dllAjlllKl; (66) 



1-1 OJ V a 



5 f (p T ;n) 



<*/(p T ;n) = - 7 , (67) 
m 

where in the second inequality, we used error bound in (|65p and the last inequality is concluded 
from bound (1611). 
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Now the conditions for Lemma[5]are satisfied and therefore we have the upper bound on || Aj^^ 5 B < 
C37 and the strict dual feasibility on S C M . Second result satisfies condition (c) of the primal-dual 
witness method feasibility conditions. The upper bound on || A^Hoo in conjunction with the lower 
bound on . > C37 (mentioned in the theorem), ensures that the sign of EJj and E/j are 

the same which results that the condition (b) of the feasibility conditions for primal-dual witness 
method is satisfied. Since all three conditions (a)-(c) are satisfied, we have equivalence between the 
modified program and the original one under conditions specified in the theorem. It gives us both 
results (a) and (b) in the theorem. Then by assuming lower bound on minimum nonzero value of 
J^, the result in part (c) is also proved. 

As mentioned before, we need to show that the dropped constraint E = Ejy/ — E# ^ is also 
satisfied. Since the conditions for Corollary [3] in Appendix IE. 51 are satisfied, we have the spectral 
norm error bound (I90p on overall covariance matrix E. Applying the inverse tail function for 
Gaussian distribution in (|70p to assumption (A. 6) results that the minimum eigenvalue of exact 
covariance matrix E* satisfies lower bound A m i n (E*) > (C4 + ^C^jdS f(p 7 ;n) + C^d 2 [5f(p T ; n)] 
where Cq := (C4 + rr ^Cz)\f2q 2 and C7 := 2q 2 C^. Then by exploiting Weyl's theorem (Theorem 
4.3.1 in |42j). the estimated covariance matrix E is positive definite and thus valid. Therefore, the 
result is proved. □ 



E Auxiliary Lemmas 

First, the tail condition for a probability distribution is defined as follows. 

Definition 3 (Tail Condition). The random vector X satisfies tail condition with parameters f 
and v* if there exists a constant v* 6 (0, 00) and function / : N x (0, 00) — > (0, 00) such that for 
any £V xV: 

P[|E n - E*-| > 5] < — i— for all 5 € (0, -]. (68) 
j [n, 0) 

Note that since the function f(n,5) is an increasing function of both variables n and 5, we define 
the inverse functions rif(r;6) and 5f(r;n) with respect to variables n and 5 respectively (when the 
other argument is fixed), where f(n,S) = r. 



E.l Concentration Bounds 

From Lemma 1 in [5], we have the following concentration bound for the empirical covariance 
matrix of Gaussian random variables. 

Lemma 3 ( [5]). Consider a set of Gaussian random variables with covariance matrix E*. Given 
n i.i.d. samples, the sample covariance matrix S n satisfies 

P[|E$ - E?-| > 5} < 4ex P |-^f ] for a11 S G (M> (69) 
for some constant q > 0. 



31 



Thus the tail function for Gaussian random vector takes the exponential form with the following 
corresponding inverse functions: 

_ 2g 2 log(4r) - / 2 g 2lo g (4r) 

n /( r ; 6 ) = — p — ' M r ; n ) = V — - — ( 7 °) 

Applying above Lemma, we get the following bound for sampling error. 

Lemma 4 ( [5J). For any r > 2 and sample size n such that 5f(p T ;n) < l/v m , we have 

P[||£||oc >6f(p T ;n)] < ^2^0. (71) 



E.2 Feasibility Conditions 

In the following lemma, we propose some conditions to bound the residual error HA^H^ s R and 
also satisfy the condition (c) of feasibility conditions required for equivalence between the witness 
solution and the original one. 
Lemma 5. Suppose that 



max-dl-RHoo, ||-E||oo 
2K SS 



1 - 2K SSh 



a 

} < -7, 
m 

a \ 

1 + -7, 
ra) 



(72) 
(73) 



then 



a) \\A.r\\oo,s r < C37 for some C 3 > 0. 

b) ||^||oo,5^ < 7- 

Proof: Applying definitions (|58p and (j59[) to optimality condition considered in second step of 
primal-dual witness method construction, gives the following equivalent equation 



Jt^AjJlr 1 - - R{Aj) +E + Z = 0. 



(74) 



Above equation is a p x p matrix equation. We can rewrite it as a linear equation with size p 2 if 
we use the vectorized form of matrices. Vectorized form of a matrix D G W xp is a column vector 
D € MP which is composed by concatenating the rows of matrix D in a single column vector. In 
the vectorized form, we have 



vec(j£r 1 AjJ* M X ) 



[Jir 1 ®Jir l )Aj = T*Aj. 



(75) 



Decomposing the vectorized form of (|7S|) into three disjoint partitions S, Sr and S C M gives the 
following decomposed form 



ss 



ss E 



SrS SrSr 
"P* "P* 



SrSI 
















Xs 






+ 















-i? + £ + Z) 
-i? + S + Z) 
-R + E + Z) 



Sr 



0, (76) 
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where we used the equalities (Aj) 5 = As and (Aj) gc = 0. Note that vector As only includes 
±A<5 entries according to the constraints in the modified program. Also note that EJj is zero in 
sets S and S^- We also dropped the argument Aj from remainder function i?(Aj) to simplify the 
notation. 

Similar to the original program, the matrix Z is composed of two parts, Zp and 2T~, i.e., Z = Zp + 
7^ 7 . Matrix Zp = from equation (|%2"]l . includes Lagrangian multipliers and Z 7 G 0||</m ||i,off- 
For set S, {Zp) s = 0, since we don't have any constraint in the program and therefore the La- 
grangian multipliers are zero. Applying this to the first row of equation (|76|) and since T$ s is 

invertible, we have the following for error Aj in set S 



(77) 



In set Sr, Zs r = (Sr) 5 ,^+7(Z 7 ) s ,^. Applying this to the second row of equation ([75]) results 

T *s R sfij) s + n R sjZ> + fin) SR + 7(^7)5, " Rs R + E Sr = 0, (78) 

Recall that we defined Ar := Hr — T,* R . Substituting ([77jl in above equation results the following 
for error Ar in set Sr 



fro 



-T* SSB \ s + R s -E s —y{Z y ). 



(79) 



r 5 fl 5«^ - 7 (Z 7 ) 5fl + # 5fl - E Sr . 



Taking element-wise norm from above equation and using inequality ||Ac||oo < I^MIoolMloo for 
any matrix A € M rxs and vector x € M s , results the bound 



lAfllloo,^ < |— r^^r^-g f* SSr + r*s R s R \loo^s + lir^^r^ 1 \\ OQ [\\Rs\\oo 



Ps||oo + 7] (80) 



where we used the fact that 1 1 1| 
mentioned in the lemma, 



+ (PsJoo + II^sJU + t), 

As and ||Z 7 |[ 00 = 1. Now if we apply the assumptions 



IA 



R\\oo,S R 



< 



2K S s{m + a) 
m(l - 2K SSr ) 



+ ( 1 + ^)( 1 + I r ^ r ^ 1 |oo 



S R S SS SS R ' S R S R \\\co 
7 = C37, 



(81) 



which proves part (a) of the Lemma. 

Now if we substitute (I77D in the equation from third row of (I76p . we have 



Z s c 



1 S^S 1 SS 



-T* SSa \ s + R s -E s -'y{Z y ) i 



S M S R 



As + Rs* - Es<: 



(82) 



Taking element-wise norm from above equation gives the following bound 



\Z 



<^ HIT -1 * "P* — t~>* ill \ I |||"p* "p* — 1 HI 

|oo,S^ S 1| 1 s c M S L SS 1 SS R - 1 S C M S R \\\od A S + P S C M S L SS III 00 



[ll-Rslloo + ll^slloo + 7] (83) 
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where we used the fact that \\Z. 



7 1| oo 



1. Applying assumption (A. 4) to above bound results 



ll^lloo,^ < (1 - a)X s + (2 - a) [\\R\ln + ||^||oo] + (1 - a)-y. 
Using assumptions stated in the Lemma, we have 



2 oo,S5, < 



< 



2K SS 



, 9r .1+-) 1-a + 2-q — + 1 

1 — 2Kss R V 771/ m 

4iW 1 + -) (1 - a) + (2 - a)— + (1 - a) 

V 771/ 771 

m — (m — l)a 4a 
4K S S J — + — + {l-a) 7<7, 

777, 777 



< 



a 



7 



(84) 



(85) 



where we used the bound on Kss R in assumption (A. 4) in the second inequality and the fact that 
a > in the third inequality. Final inequality is derived from assumption (A. 5) which finishes the 
proof of part (b). □ 



E.3 Control of Remainder 

In the following Lemma which is stated and proved in lemma 5 in [5], the argument Aj controls 
the remainder function behavior. 

Lemma 6. Suppose that the element-wise bound HAjHoq < u^ Md for some I > 1 holds. Then 



r{aj) = {jir^jfQj, 



M ) 



(86) 



-wise 



where Q := J2T=o(~ l)*(^Af 1 ^-j) k w ^ b° un d |||Q T '||| o — tzp Also, in terms of element- 
norm, we have 

||i2(Aj)||oo< j^i^l^H^Af ( 87 ) 



E.4 Control of Aj 

According to the primal-dual witness solutions construction, we have the error bounds on A j within 
the sets Sr and S C M such that || Aj^^ = A5 and HAjHoq^c = 0. In the following lemma, we 

propose some conditions to control the error HAjj^^. 
Lemma 7. Suppose that 



r :- 



/ L 



1 - 1 



:= 2K SSR Xs + 2K 8S (\\EU + t) < min j ^ , j , 

then we have the following element-wise bound for (Aj)^, 

IIAjHoc^ < r. 



(88) 



34 



The proof is within the same lines of Lemma 6 proof in [5] but with some modifications since the 
error HAjHoq is not zero and therefore the nonzero value \$ arises in the final result. Since 
the modified optimization program (|4ip is different with the modified program in [5], it is worth 
discussing about existing a unique solution for the modified optimization program (I4ip . This 
uniqueness can be shown with similar discussion presented in Appendix [B] for uniqueness of the 
solution of original program (139p . We only need to show that there is no problem in uniqueness by 
removing the off-diagonal constaraints for set S in the modified program. By Lagrangian duality, 
the l\ penalty term 7|| Jw ||i,off can De moved to constraints as || Jm ||i,off — f° r some bounded 
(7(7). Therefore, the off-diagonal entries in set S where the corresponding constraints were relaxed 
in the modified program are still bounded because of this i\ constraint. Hence, the modified 
program (|41|) has a unique solution. 

E.5 Spectral norm error bound on overall covariance matrix £ = J^ 1 — E# 

Corollary 3. Under the same assumptions (excluding (A. 6)) as Theorem^ with probability greater 
than 1 — l/p c , the overall covariance matrix estimate £ = Sjv/ — £_r satisfies spectral norm error 
bound 

|||S -E* I < (c 4 + ^C 3 yS f (p T ;n) + C 5 d 2 [5 f (p T ;n)] 2 . (90) 

Proof: We first bound the spectral norm errors for the Markov and residual covariance matrices 
"Em an d Along the same lines as Corollary 4 proof in [5], the spectral norm error |£m — ^a/II 
can be bounded as 

||£m-£mI < C 4 d5 f (p T ;n) + C 5 d 2 \5 f (p T ;n)} 2 , (91) 

where C 4 = 4(1 + ^)K SS K 2 M and C 5 = $(l + ^) 2 K 2 S K 3 M . 
The spectral norm error |||E_r — can be also bounded as 

^ ^ ^ nrn 

|Efl - T? R \ < |Sfl - SJel^ < d\\Z R - X* R \\oc < -C 3 d5 f (p T ;n), (92) 

where the first inequality is the property of spectral norm which is bounded by ^oo-operator norm, 
second inequality is a result of the fact that and T* R has at most d nonzero entries in each row 
(since Sr C Sm) and the last inequality is concluded from the upper bound on element-wise 
norm error on residual matrix estimation stated in part (a) of Theorem [2j 

Applying the above bounds to the overall covariance matrix estimation S = Ejvf — and using 
the triangular inequality for norms, the bound in (|90p is proven. □ 

F Proof of Corollary [2] 

Proof: The result in this corollary is a special case of general result in Theorem [2] when A* = 
and some minor modifications are considered in problem formulation. Note that, it is expressed in 
assumption (A.l) that the off-diagonal entries of exact Markov matrix are upper bounded by 
some positive A*. In order to extend the proof to the case of A* = (The case in this corollary), we 
need some minor modifications. First, the identifiability assumptions (A.0)-(A.3) can be ignored 
and instead it is assumed that the Markov part J^j (or equivalently S^f) is diagonal and the residual 
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part T,* R has only nonzero off-diagonal entries. Since the diagonal Markov matrix and off-diagonal 
residual matrix do not have any nonzero overlapping entries, it is natural that we do not require 
any more identifiability assumptions. Then, with these new assumptions, the set Sm is defined as 
Sm '■= Sr U \i = l,...,p} where Sr is defined the same as (fl~7|) and also set S is defined the 
same as (|18p which results that set S includes only diagonal entries. Thus, the off-diagonal entries 
belongs to sets Sr and S C M . Since T,* M is a diagonal matrix, all submatrices of T* which are indexed 
by sets Sr or S M are complete zero matrices. The result is that the terms which are bounded in 
the mutual incoherence condition (A. 4) are already zero and thus there is no need to consider those 
additional assumptions in the corollary. 

By making these changes in the problem formulation, the result in Corollary [2] can be proven within 
the same lines of general result proof in Theorem [2j It is only required to change the constraint on 
set Sr in the modified optimization program to (Jm) Sr = A sign ^(E^)^ J . □ 
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